[data][feat] add large dataset support #1298

SwordFaith · 2025-12-31T04:29:47Z

Thank you for the efforts on radixark/miles#246 and the hard work contributed by @Ratish1!

ChangyiYang · 2026-01-05T06:33:05Z

@codex review

chatgpt-codex-connector · 2026-01-05T06:33:10Z

To use Codex here, create a Codex account and connect to github.

ChangyiYang · 2026-01-05T06:39:12Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8ee19cc3fa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-05T06:44:30Z

slime/backends/fsdp_utils/data_packing.py

+    # Store original content length before padding for correct unpacking
+    packed_sequence["content_length"] = seq_length
+


Preserve original content_length across CP padding

In CP mode, _get_model_inputs_args pads the same packed batch multiple times (e.g., once during _compute_log_prob and again during _train_step). pad_packed_sequence_with_cp overwrites content_length unconditionally, so on the second call it records the already padded length. unpack_sequences then computes pad_length as 0 and slices log_probs/entropy for the last sequence using the padded end index, which drops real tokens and includes padding when cp_size > 1, corrupting the loss/metrics for that sequence. Consider only setting content_length if it’s absent or skipping padding when the batch is already padded.

Useful? React with 👍 / 👎.

PopSoda2002

Thanks for your great work! I am not sure if we need to seperate datasets and SFT into different PRs, currently this PR seems too large for reviewing

PopSoda2002 · 2026-01-05T10:44:19Z

slime/backends/fsdp_utils/actor.py

+            if self.args.calculate_per_token_loss:
+                sft_loss = sum_of_token(sft_loss, response_lengths, loss_masks)
+            else:
+                sft_loss = sum_of_sample_mean(sft_loss, response_lengths, loss_masks)


I think we just need to calculate_per_sample_loss in SFT?

After further discussion, we realized that SFT should only use per-token loss.
We’ll simplify this logic by keeping token loss only here.
For users who still try to use sequence / per-sample loss in SFT, we’ll explicitly raise an error to avoid silent misconfiguration.

zhuzilin · 2026-01-05T10:55:06Z

slime/rollout/data_source.py

+                seed=self.args.rollout_seed,
+                apply_chat_template=self.args.apply_chat_template,
+                apply_chat_template_kwargs=self.args.apply_chat_template_kwargs,
+                dp_size=self._dp_size or 1,


We should not pass dp size to the data source because the data source is used in rollout manager, which does not have dp ranks.

We should not pass dp size to the data source because the data source is used in rollout manager, which does not have dp ranks.

Apologies for the mistake, I’ll address it in the upcoming commits.

SwordFaith · 2026-01-06T01:47:30Z

Thanks for your great work! I am not sure if we need to seperate datasets and SFT into different PRs, currently this PR seems too large for reviewing

I’ll work on some simplifications with @ChangyiYang today. Afterward, could you review it again and discuss if we should split this PR into two?

PopSoda2002 · 2026-01-06T14:49:36Z

Thanks for your great work! I am not sure if we need to seperate datasets and SFT into different PRs, currently this PR seems too large for reviewing

I’ll work on some simplifications with @ChangyiYang today. Afterward, could you review it again and discuss if we should split this PR into two?

Yeah sure, definitely, always willing for help

SwordFaith marked this pull request as draft December 31, 2025 04:34

SwordFaith added 7 commits January 3, 2026 13:38

Add first version implementation

990fbd9

Remove cached hf dataset, and fix a few bugs

617b1c3

Add new test cases

8a78a27

Refactor implementation via IterableDataset

3706a81

Update doc

c3dfe18

Add split name support and switch sft dataset used in e2e test

37b45d6

Add hf_dataset_split support in test_hf_datasets.py

7d855c6

SwordFaith force-pushed the swordfaith/feat/add_large_sft_data_support branch from 24fb033 to 7d855c6 Compare January 3, 2026 13:55

SwordFaith added 2 commits January 4, 2026 01:19

Add strict=True to megatron cp

a202bf6

Fix tensor mismatch in sft e2e testing

d2cc00a

PopSoda2002 mentioned this pull request Jan 4, 2026

[Feat.][FSDP] Support FSDP SFT radixark/miles#380

Closed

Merge branch 'main' into swordfaith/feat/add_large_sft_data_support

8ee19cc

chatgpt-codex-connector bot reviewed Jan 5, 2026

View reviewed changes

PopSoda2002 reviewed Jan 5, 2026

View reviewed changes

zhuzilin reviewed Jan 5, 2026

View reviewed changes

SwordFaith added 2 commits January 6, 2026 01:40

Update docs

8245e47

Verified data ckpt offset recovery

23f0fb5

Try to simplify redundant abstractions and remove dp_size

57bee46

SwordFaith marked this pull request as ready for review January 6, 2026 05:50

SwordFaith changed the title ~~[WIP][data][feat] add large dataset support~~ [data][feat] add large dataset support Jan 6, 2026

Merge branch 'main' into swordfaith/feat/add_large_sft_data_support

beef145

ChangyiYang and others added 3 commits January 7, 2026 01:55

sft should only use token loss

4588976

Merge branch 'main' into swordfaith/feat/add_large_sft_data_support

4987724

add more test cases

858cc43

ChangyiYang added 4 commits January 7, 2026 23:25

use filter long prompt

7b4a166

remove unnecessary file

cb9b16a

remove wrong test

b2605ce

Merge branch 'main' into swordfaith/feat/add_large_sft_data_support

916ef94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data][feat] add large dataset support #1298

[data][feat] add large dataset support #1298

SwordFaith commented Dec 31, 2025 •

edited

Loading

Uh oh!

ChangyiYang commented Jan 5, 2026

Uh oh!

chatgpt-codex-connector bot commented Jan 5, 2026

Uh oh!

ChangyiYang commented Jan 5, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Jan 5, 2026

Uh oh!

PopSoda2002 left a comment

Uh oh!

PopSoda2002 Jan 5, 2026

Uh oh!

ChangyiYang Jan 7, 2026

Uh oh!

zhuzilin Jan 5, 2026

Uh oh!

SwordFaith Jan 6, 2026

Uh oh!

SwordFaith commented Jan 6, 2026

Uh oh!

PopSoda2002 commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		# Store original content length before padding for correct unpacking
		packed_sequence["content_length"] = seq_length

[data][feat] add large dataset support #1298

Are you sure you want to change the base?

[data][feat] add large dataset support #1298

Conversation

SwordFaith commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChangyiYang commented Jan 5, 2026

Uh oh!

chatgpt-codex-connector bot commented Jan 5, 2026

Uh oh!

ChangyiYang commented Jan 5, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

PopSoda2002 left a comment

Choose a reason for hiding this comment

Uh oh!

PopSoda2002 Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

ChangyiYang Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

zhuzilin Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

SwordFaith Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

SwordFaith commented Jan 6, 2026

Uh oh!

PopSoda2002 commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SwordFaith commented Dec 31, 2025 •

edited

Loading